Genomics data is large
Slide on sequencing technology evolution / Moore’s law -> something to highlight the current data explosion and the difficulties of analysing large-scale data
cf Yan’s slide from 2011
Population genomics
From sample collection and preparation through assembly, sequencing, variant calling and other data processing steps to genotype matrices
Making sense of variation data
The Genomic Landscape Underlying Phenotypic Integrity in the Face of Gene Flow in Crows Poelstra et al. (2014 )
Genotype matrices represent the raw data; we want to convert it to information, e.g., by doing genome scans for selection, differentiation, calculating diversity, etc. In any case, these analyses are manifestations of evolutionary relationships between individuals and populations.
Segue: relationships can be represented as trees.
Genotype matrices and genealogical trees
Efficiently Summarizing Relationships in Large Samples: A General Duality Between Statistics of Genealogies and Genomes. Ralph et al. (2020 ) , Fig. 1
Usually we summarize data in a table of genotypes (genotype matrix). The sequences are related to oneanother, a relationship that can be illustrated as a tree. By overlaying information on mutations on the tree, we can also regenerate the sequences, as shown here. Note also that the figure hints that a tree representation is a more efficient representation of the data, in addition to being more accurate and containing more information (demographic events, history, …)
Trees capture biology
cf https://tskit.dev/tutorials/viz.html
segue: problem is that sequences recombine leading to
Segue: need to transition to ARGs somehow; trees are genealogies, but they change with recombination
Recombination modifies gene genealogies
Most organisms recombine a lot!
Miller (2020 ) , Fig. 5.12.3
Ancestral recombination graphs (ARGs)
A “genetic genealogy” tracks genome-wide inheritance paths
Evaluation of Methods for Estimating Coalescence Times Using Ancestral Recombination Graphs Y. C. Brandt et al. (2022, fig. 1 )
Marginal trees are correlated (sequentially Markovian coalescent). Evidently sequence of marginal trees good candidate to approximate ARG. Is there any such structure?
msprime enables simulation of large chromosomes with recombination
Graph representation
0
0
1
1
2
2
3
3
4
4
4->2
4->3
5
5
5->0
5->1
6
6
6->0
6->1
7
7
7->1
7->4
8
8
8->1
8->4
8->5
9
9
9->0
9->7
9->8
10
10
10->0
10->7
11
11
11->0
11->4
11->6
11->7
12
12
12->4
12->6
Local tree representation
Genome position 0 30 58 133 232 568 743 1000 2 3 4 0 1 6 12 2 3 4 0 1 6 11 0 1 2 3 4 7 11 0 1 2 3 4 7 10 0 1 1 0 2 3 4 7 9 2 0 1 2 3 4 8 9 2 3 4 3 0 1 5 8
Tree sequences compress data and speedup analyses
Compact storage (“domain specific compression”)
Fast, efficient analysis (a “succinct” structure)
Well tested, open source (active dev community)
…but limited support for major genomic rearrangements (e.g. inversions, large indels): genomes should be (reasonably) aligned => current primary focus = population genetics
Getting hold of tree sequences
Other programs that don’t output tree sequence format by default: ARGweaver , Argneedle
tskit terminology: the basics
Genome position 0 307 567 1000 Time (generations) 0 3 10 30 100 300 1000 1 4 7 0 3 2 5 6 8 10 11 3 2 5 6 8 0 1 4 7 9 10 3 2 5 6 8 0 1 4 7 9 11
Multiple local trees exist along a genome of fixed length (by convention measured in base pairs)
Genomes exist at specific times, and are represented by nodes (the same node can persist across many local trees)
Some nodes are most recent common ancestors (MRCAs) of other nodes
Entities are zero-based: the first node has id 0, the second id 1, …
tskit terminology: nodes and edges
Genome position 0 307 567 1000 Time (generations) 0 3 10 30 100 300 1000 1 4 7 0 3 2 5 6 8 10 11 3 2 5 6 8 0 1 4 7 9 10 3 2 5 6 8 0 1 4 7 9 11
Nodes (=genomes)
exist at a specific time
can be flagged as “samples”
can belong to “individuals ” (e.g., 2 nodes per individuals in humans) and, if useful, “populations ”
0
1
0
0
0.00000000
1
1
0
0
0.00000000
2
1
0
1
0.00000000
3
1
0
1
0.00000000
4
1
0
2
0.00000000
5
1
0
2
0.00000000
6
0
0
-1
14.70054184
7
0
0
-1
40.95936939
8
0
0
-1
72.52965866
9
0
0
-1
297.22307150
10
0
0
-1
340.15496436
11
0
0
-1
605.35907657
Edges
Connect a parent & child
Have a left & right genomic coordinate
Usually span multiple trees (e.g., edges connecting nodes 1+7 and 4+7)
0
0
1000
6
2
1
0
1000
6
5
2
0
1000
7
1
3
0
1000
7
4
4
0
1000
8
3
5
0
1000
8
6
6
307
1000
9
0
7
307
1000
9
7
8
0
307
10
0
9
0
567
10
8
10
307
567
10
9
11
0
307
11
7
12
567
1000
11
8
13
567
1000
11
9
14
0
307
11
10
tskit terminology: sites and mutations
Genome position 0 307 567 1000 Time (generations) 0 3 10 30 100 300 1000 1 4 7 1 0 3 2 5 6 0 8 10 11 5 2 3 2 5 6 8 0 1 4 4 6 7 3 9 10 3 2 5 6 8 8 7 9 0 1 4 7 9 11
This is how we can encode genetic variation. Most genomic positions do not vary between genomes: usually we don’t bother tracking these.
tskit terminology: sites and mutations
Genome position 0 307 567 1000 Time (generations) 0 3 10 30 100 300 1000 1 4 7 1 0 3 2 5 6 0 8 10 11 5 2 3 2 5 6 8 0 1 4 4 6 7 3 9 10 3 2 5 6 8 8 7 9 0 1 4 7 9 11
We can create a site at a given genomic position with a fixed ancestral state .
0
52
C
1
200
A
2
335
A
3
354
A
4
474
G
5
523
A
6
774
C
7
796
C
8
957
A
This is how we can encode genetic variation. Most genomic positions do not vary between genomes: usually we don’t bother tracking these.
tskit terminology: sites and mutations
Genome position 0 307 567 1000 Time (generations) 0 3 10 30 100 300 1000 1 4 7 1 0 3 2 5 6 0 8 10 11 5 2 3 2 5 6 8 0 1 4 4 6 7 3 9 10 3 2 5 6 8 8 7 9 0 1 4 7 9 11
We can create a site at a given genomic position with a fixed ancestral state .
0
52
C
1
200
A
2
335
A
3
354
A
4
474
G
5
523
A
6
774
C
7
796
C
8
957
A
This is how we can encode genetic variation. Most genomic positions do not vary between genomes: usually we don’t bother tracking these.
Normally, a site is created in order to place one or more mutations at that site
0
0
8
247.85988972
T
-1
1
1
0
169.80687857
C
-1
2
2
3
31.84262397
C
-1
3
3
9
326.26095349
C
-1
4
3
7
71.04212649
T
3
5
4
3
42.72352948
C
-1
6
5
7
55.44045835
T
-1
7
6
0
259.82567754
T
-1
8
7
8
169.87040769
G
-1
9
8
0
42.47396523
C
-1
Using tskit
SLiM
tskit and biodiversity
tskit assumes
known ancestral state
phased genomes
and requires fairly large sample sizes to leverage power of data compression (\(n>1000\) ) and speedup of statistical analyses (\(n\) in the hundreds)
…conditions that are not always met for natural populations of non-model organisms
Reasons to use tskit ecosystem for evolution and biodiversity
future-proofing
cheaper and longer read sequencing will require this sort of approach
simulation
simulation software builds on tskit (msprime/SLiM/stdpopsim)
biology
thinking in trees captures the “true” biology (unless structural variation)
statistical power
trees capture genealogical history and variation and potentially have more statistical power than other methods, such as summary statistics
teaching
biodiversity crowd very familiar with phylogenetic trees making the extension to tree sequences a short jump
modelling of complex histories
complex histories with, e.g., hybridization / speciation, will have lots of ILS / conflicting trees which needs to be tackled somehow
alternatives to tsinfer
tsinfer is only one way to infer genealogies but easy to introduce and demonstrate
Application: Evolutionary genomics of the Motacilla alba (white wagtails) radiation
Figure 1: Motacilla alba subspecies; from top left M. yarrellii , M. personata , M. baicalensis , M. alba , M. alboides , M. leucopsis , M. lugens , M. ocularis , and M. subpersonata . Paintings by Bill Zetterström (from Alström & Mild (2003 ) )
Together with Erik Enbody, Tom van der Valk, Leif Andersson, and Per Alström.
::::
Genealogical nearest neighbour chromosome plots